Goto

Collaborating Authors

 document-term matrix


Uncovering Key Trends in Industry 5.0 through Advanced AI Techniques

arXiv.org Artificial Intelligence

This article analyzes around 200 online articles to identify trends within Industry 5.0 using artificial intelligence techniques. Specifically, it applies algorithms such as LDA, BERTopic, LSA, and K-means, in various configurations, to extract and compare the central themes present in the literature. The results reveal a convergence around a core set of themes while also highlighting that Industry 5.0 spans a wide range of topics. The study concludes that Industry 5.0, as an evolution of Industry 4.0, is a broad concept that lacks a clear definition, making it difficult to focus on and apply effectively. Therefore, for Industry 5.0 to be useful, it needs to be refined and more clearly defined. Furthermore, the findings demonstrate that well-known AI techniques can be effectively utilized for trend identification, particularly when the available literature is extensive and the subject matter lacks precise boundaries. This study showcases the potential of AI in extracting meaningful insights from large and diverse datasets, even in cases where the thematic structure of the domain is not clearly delineated.


A Novel Two-Step Method for Cross Language Representation Learning

Neural Information Processing Systems

Cross language text classification is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Specifically, we first formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classification tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small.


How to Use SVD and NMF in Python

#artificialintelligence

In the context of Natural Language Processing (NLP), topic modeling is an unsupervised learning problem whose goal is to find abstract topics in a collection of documents. Topic Modeling answers the question: "Given a text corpus of many documents, can we find the abstract topics that the text is talking about?" By the end of this tutorial, you'll be able to build your own topic models to find topics in any piece of text. Let's start by understanding what topic modeling is. Suppose you're given a large text corpus containing several documents.


A comparison of latent semantic analysis and correspondence analysis of document-term matrices

arXiv.org Artificial Intelligence

Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, i.e. sums of row elements and column elements, arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.


Learning Neural Networks on SVD Boosted Latent Spaces for Semantic Classification

arXiv.org Machine Learning

The availability of large amounts of data and compelling computation power have made deep learning models much popular for text classification and sentiment analysis. Deep neural networks have achieved competitive performance on the above tasks when trained on naive text representations such as word count, term frequency, and binary matrix embeddings. However, many of the above representations result in the input space having a dimension of the order of the vocabulary size, which is enormous. This leads to a blow-up in the number of parameters to be learned, and the computational cost becomes infeasible when scaling to domains that require retaining a colossal vocabulary. This work proposes using singular value decomposition to transform the high dimensional input space to a lower-dimensional latent space. We show that neural networks trained on this lower-dimensional space are not only able to retain performance while savoring significant reduction in the computational complexity but, in many situations, also outperforms the classical neural networks trained on the native input space.


How Stuff Works: A Comprehensive Topic Modelling Guide with NMF, LSA, PLSA, LDA & lda2vec (Part-1)

#artificialintelligence

This article is a comprehensive overview of Topic Modeling and its associated techniques. This is the first part of the article and will cover NMF, LSA and PLSA only. The LDA and lda2vec will be covered in the next part here. In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning -- from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics.


Going deep in clustering high-dimensional data: deep mixtures of unigrams for uncovering topics in textual data

arXiv.org Machine Learning

They can be basically defined as a multi-layer stack of algorithms or modules able to gradually learn a huge number of parameters in an architecture composed by multiple nonlinear transformations (LeCun et al., 2015). Typically, and for historical reasons, a structure for deep learning is identified with advanced neural networks: deep Feed Forward, Recurrent, Auto-encoder, Convolution neural networks are very effective and used algorithms of deep learning (Schmidhuber, 2015). They demonstrated to be particularly successful in supervised classification problems arising in several fields such as image and speech recognition, gene expression data, topic classification. When the aim is uncovering unknown classes in a unsupervised classification perspective, important methods of deep learning have been developed along the lines of mixture modeling, because of their ability to decompose a heterogeneous collection of units into a finite number of subgroups with homogeneous structures (Fraley and Raftery, 2002; McLachlan and Peel, 2000). In this direction, van den Oord and Schrauwen (2014) proposed Multilayer Gaussian Mixture Models for modeling natural images; Tang et al. (2012) defined deep mixture of factor analyzers with a greedy layer-wise learning algorithm able to learn each layer at a time. Viroli and McLachlan (2019) developed a general framework for Deep Gaussian mixture models that generalizes and encompasses the previous strategies and several flexible model-based clustering methods such as mixtures of mixture models (Li, 2005), mixtures of Factor Analyzers (McLachlan et al., 2003), mixtures of factor analyzers with common factor loadings (Baek et al., 2010), heteroscedastic factor mixture analysis (Montanari and Viroli, 2010) and mixtures of factor mixture analyzers introduced by Viroli (2010). A general'take-home-message' coming from the existing deep clustering strategies is that deep methods vs shallow ones appear to be very efficient and powerful tools especially for complex high-dimensional data; on the contrary, for simple and small data structures, a deep learning strategy cannot improve performance of simpler and conventional methods or, to better say, it is like to use a'sledgehammer to crack a nut'. The motivating problem behind this work derives from ticket data (i.e.


How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec

#artificialintelligence

This article is a comprehensive overview of Topic Modeling and its associated techniques. In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning -- from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics. The process of learning, recognizing, and extracting these topics across a collection of documents is called topic modeling. In this post, we will explore topic modeling through 4 of the most popular techniques today: LSA, pLSA, LDA, and the newer, deep learning-based lda2vec.


A Comparison of Machine Learning Algorithms for the Surveillance of Autism Spectrum Disorder

arXiv.org Machine Learning

The Centers for Disease Control and Prevention (CDC) coordinates a labor-intensive process to measure the prevalence of autism spectrum disorder (ASD) among children in the United States. Random forests methods have shown promise in speeding up this process, but they lag behind human classification accuracy by about 5 percent. We explore whether newer document classification algorithms can close this gap. We applied 6 supervised learning algorithms to predict whether children meet the case definition for ASD based solely on the words in their evaluations. We compared the algorithms? performance across 10 random train-test splits of the data, and then, we combined our top 3 classifiers to estimate the Bayes error rate in the data. Across the 10 train-test cycles, the random forest, neural network, and support vector machine with Naive Bayes features (NB-SVM) each achieved slightly more than 86.5 percent mean accuracy. The Bayes error rate is estimated at approximately 12 percent meaning that the model error for even the simplest of our algorithms, the random forest, is below 2 percent. NB-SVM produced significantly more false positives than false negatives. The random forest performed as well as newer models like the NB-SVM and the neural network. NB-SVM may not be a good candidate for use in a fully-automated surveillance workflow due to increased false positives. More sophisticated algorithms, like hierarchical convolutional neural networks, would not perform substantially better due to characteristics of the data. Deep learning models performed similarly to traditional machine learning methods at predicting the clinician-assigned case status for CDC's autism surveillance system. While deep learning methods had limited benefit in this task, they may have applications in other surveillance systems.


Text Processing in R

@machinelearnbot

This tutorial goes over some basic concepts and commands for text processing in R. R is not the only way to process text, nor is it always the best way. Python is the de-facto programming language for processing text, with a lot of built-in functionality that makes it easy to use, and pretty fast, as well as a number of very mature and full featured packages such as NLTK and textblob. Basic shell scripting can also be many orders of magnitude faster for processing extremely large text corpora -- for a classic reference see Unix for Poets. Yet there are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. Furthermore, there is a lot of very active development going on in the R text analysis community right now (see especially the quanteda package).